Linear Time Clustering for High Dimensional Mixtures of Gaussian Clouds
نویسندگان
چکیده
Clustering mixtures of Gaussian distributions is a fundamental and challenging problem that is ubiquitous in various high-dimensional data processing tasks. While state-of-the-art work on learning Gaussian mixture models has focused primarily on improving separation bounds and their generalization to arbitrary classes of mixture models, less emphasis has been paid to practical computational efficiency of the proposed solutions. In this paper, we propose a novel and highly efficient clustering algorithm for n points drawn from a mixture of two arbitrary Gaussian distributions in R. The algorithm involves performing random 1-dimensional projections until a direction is found that yields a user-specified clustering error e. For a 1-dimensional separation parameter γ satisfying γ = Q−1(e), the expected number of such projections is shown to be bounded by o(ln p), when γ satisfies γ ≤ c √ ln ln p, with c as the separability parameter of the two Gaussians in R. Consequently, the expected overall running time of the algorithm is linear in n and quasi-linear in p at o(ln p)O(np), and the sample complexity is independent of p. This result stands in contrast to prior works which provide polynomial, with at-best quadratic, running time in p and n. We show that our bound on the expected number of 1-dimensional projections extends to the case of three or more Gaussian components, and we present a generalization of our results to mixture distributions beyond the Gaussian model.
منابع مشابه
Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures
We consider the problem of clustering data points in high dimensions, i.e., when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with two non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for le...
متن کاملMinimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture...
متن کاملHigh-Dimensional Unsupervised Active Learning Method
In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...
متن کاملAdaptive Mixtures of Factor Analyzers
A mixture of factor analyzers is a semi-parametric density estimator that generalizes the well-known mixtures of Gaussians model by allowing each Gaussian in the mixture to be represented in a different lower-dimensional manifold. This paper presents a robust and parsimonious model selection algorithm for training a mixture of factor analyzers, carrying out simultaneous clustering and locally l...
متن کاملA New Algorithm in Blind Source Separation for High Dimensional Data Sets Such as Meg Data
BSS is one of the well-known methods of signal processing. This method is based on recovering of original sources from observed mixtures without any further information about mixing system and original sources. In many applications, mixtures are combination of Non-Gaussian and Time-Correlated components. MCOMBI algorithm is known as a method for separation of these kinds of sources. The perform...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1712.07242 شماره
صفحات -
تاریخ انتشار 2017